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Abstract. - When users rate objects, a sophisticated algorithm that takes into account ability 
or reputation may produce a fairer or more accurate aggregation of ratings than the straightfor- 
ward arithmetic average. Recently a number of authors have proposed different co-determination 
algorithms where estimates of user and object reputation are refined iteratively together, permit- 
ting accurate measures of both to be derived directly from the rating data. However, simulations 
demonstrating these methods' efficacy assumed a continuum of rating values, consistent with typ- 
ical physical modelling practice, whereas in most actual rating systems only a limited range of 
discrete values (such as a 5-star system) is employed. We perform a comparative test of several 
co-determination algorithms with different scales of discrete ratings and show that this seemingly 
minor modification in fact has a significant impact on algorithms' performance. Paradoxically, 
where rating resolution is low, increased noise in users' ratings may even improve the overall 
performance of the system. 



Introduction. — With the growth of the internet 
and e-commerce pQ, an increasing number of our so- 
cial and commercial interactions are now one-shot ex- 
changes with strangers identifiable only by easily-replaced 
pseudonyms [2 ]. Similarly, most items on sale from e- 
commerce websites must be purchased without an oppor- 
tunity to try them first, creating an information asymme- 
try that encourages the provision of low-quality goods [3] . 
To offset this risk of fraud or deception, many online ser- 
vices implement reputation systems [4] that collect ratings 
and feedback from users so as to provide a measure of 
trustworthiness for goods or individuals. 

A key challenge is how to aggregate this feedback ef- 
fectively given that not all ratings are equal. Some users' 
judgement may be poor or malicious: for example, many 
eBay users forgo issuing deserved negative feedback to 
cheaters because the negative feedback they will receive in 
reprisal will devastate their own carefully cultivated good 
reputation 5 . An effective reputation system thus needs 
to distinguish between good and bad raters and ratings. 
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One approach to this has been the development of co- 
determination algorithms of reputation, where aggregate 
reputation (or quality) of rated object£j]is used to estimate 
a corresponding reputation (or ability) for the system's 
users, and this latter measure is then used to re- weight 
the aggregation of ratings for objects [SHE]- By iterat- 
ing this procedure over time, ratings from malicious or 
unskilled users can be weeded out, providing both a bet- 
ter estimation of object quality and an enhanced overall 
reputation-based ranking of objects. 

Simulations to evaluate the effectiveness of these meth- 
ods followed typical modelling practices in physics and 
applied mathematics, assuming a continuum of rating val- 
ues (reflecting what may be presumed to be fine-grained 
shades of opinion). However, a near-universal feature of 
real user feedback and rating systems is that they permit 
ratings to take only a limited range of discrete values — 
most commonly the 5-star system employed by Amazon, 



We use 'object' simply as a generic term: the object of the 
rating. This might be an actual object, such as a book or CD, or 
it might be a person or organization, such as an eBay auctioneer, a 
website, or an Amazon Marketplace seller. 
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YouTube, etc. The influence of this constraint has never 
been tested on the aforementioned algorithms, and the 
main purpose of the present letter is to explore how this 
quantization of ratings affects the co-determination pro- 
cedure and the resulting ranking and reputation values. 

Our simulations show that if the number of available 
rating choices is too few, this has a strong negative impact 
on the algorithms' performance. Paradoxically, in such 
circumstances, having a community of users more prone 
to individual rating errors may actually increase the over- 
all performance of the system. We compare these results 
with psychometric research on the measurement of atti- 
tudes, and discuss the implications for the construction of 
effective online reputation, ranking and rating systems. 

Algorithms. — The reputation and ranking algo- 
rithms explored in this paper all operate upon the same 
basic type of data. Suppose we have a set U of users who 
have each rated some subset of the complete set O of ob- 
jects. For notational clarity we use Latin letters . . . ) 
for user-related indices and Greek letters (a, /3, . . . ) for 
object-related indices. The set of users who rated a given 
object a is denoted by U ai while the set of objects rated 
by a user i is denoted by Oi, and the value of the rating 
of object a by user i is denoted by Ti a . 

We assume that each object has an intrinsic quality Q a 
from which the received ratings differ to a greater or lesser 
degree depending on the ability of the user. While in 
some online reputation systems there is an opportunity for 
users to 'rate the ratings', providing an extra measure of 
user reputation, we do not rely on the availability of such 
information: all the algorithms described here calculate 
user ability solely on the basis of the rating data. On the 
basis of such measures of user ability we can then estimate 
object quality using a weighted average of the ratings, 
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where the user weights Wi are constructed by one of the 
following algorithms. 

(i) Arithmetic average (AA). The baseline for compari- 
son of reputation and ranking methods is simply to treat 
all user ratings equally, setting Wi — const V i G U, or 
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which is of course the actual aggregation method used on 
most websites. 

(ii) Mizzaro's algorithm (Mizz). Mizzaro has intro- 
duced a co-determination algorithm for the assessment of 
scholarly articles, with reputation scores for authors, arti- 
cles and readers that co-evolve over time according to the 
ratings readers give to papers. The algorithm can readily 
be applied to the more general user-object case we consider 
here, with author scores omitted since their evolution is 
decoupled from the evolution of article and reader scores 



and they are irrelevant in the present context. For con- 
sistency with the rest of the paper we refer henceforth to 
objects and users instead of articles and readers. 

The algorithm can be implemented in two versions, an 
incremental one where ratings are added one by one and 
an iterative one that can be applied to a pre-existing 
dataset. We have implemented both versions (which in 
any case, given the same data, produce the same result), 
but for ease of comparison to the other algorithms we de- 
scribe here the simpler, iterative version. 

Given a set of user weights w; , object quality values q a 
are calculated according to Eq. ([T]). User weights are then 
recalculated according to, 



Wi 



where 
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is a measure of steadiness of object quality q a , and 
9ia = 1 - \JVia - q a \/Ar 



(3) 
(4) 

(5) 



is a measure of disagreement between the given rating and 
the object score. Ar represents the width of the rating 
range, i.e. the difference between the smallest and largest 
possible rating values, and this normalization guarantees 
that the value of g will fall within the range [0; 1]. 

The algorithm is initialized by setting equal weights 
Wi — 1 for all users i and then iterating repeatedly over the 
equations ([U EJ) until the change in the vector of quality 
estimates between successive iteration steps, 
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falls below a certain threshold valu^l A (in our simula- 
tions, we use A = 10~ 4 ). 

(iii) The Yu- Zhang- Laureti-Moret algorithm (YZLM). 
Yu and colleagues [7] have introduced an algorithm that 
is essentially a generalized version of maximum likelihood 
estimation (MLE) [9], using a control parameter /3 > to 
determine how divergence from the community consensus 
affects user weight w;. Their own implementation consid- 
ers only the case where all users have rated all objects, 
but it is trivial to generalize it to operate on sparse data. 

Estimated object quality values q a are again calculated 
according to Eq. [TJ We then calculate the divergence be- 
tween the ratings of each user i and the estimated object 
quality values, 
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2 Note that the algorithm may fail to converge if the threshold 
A is set too low [8]. Conversely, too large a threshold may disrupt 
the iterative process. It may therefore take a few trials to choose an 
appropriate value. 
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and the updated weight of user i is then given by 



individual user estimates of object quality according to, 
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where the exponent (3 > determines the strength of the 
penalty applied to users with larger rating divergence di 
(note that (3 — corresponds simply to the arithmetic av- 
erage) and < e < 1 is a small positive constant so as 
to prevent user weights diverging (in our simulations, we 
use e = 10~ 8 ). Yu et al. [7] noted that while /3 = 1/2 pro- 
vides better numerical stability of the algorithm as well 
as translational and scale invariance, (3 = 1 is the optimal 
algorithm from the point of view of mathematical statis- 
tics [10]. We have used (3=1 because it yields superior 
performance, but choosing (3 = 1/2 does not alter the fun- 
damental character of the results obtained here. 

The algorithm is initialized like Mizz, by setting the 
weights Wi = 1 for all users i and then iterating repeat- 
edly over the equations (UJ [3 [8]) until the vector of quality 
estimates q changes less than the threshold value A. 

(iv) de Kerchove and Van Dooren's algorithm (dKVD). 
De Kerchove and Van Dooren [8] have introduced an al- 
gorithm similar to YZLM, but where the weight update 
function is given instead by 



= 1 — kd; 



(9) 



where k is chosen such that Wi > 0. This has the ad- 
vantage of guaranteeing convergence to a unique solution 
independent of starting weights (though in practice this 
is not a particular problem with any of the algorithms). 
In our simulations we adopt the strongest possible pun- 
ishment of noisy raters by setting k = [e + maxjg^ dj] 1 , 
where e ensures non-zero weights if the dj are all identi- 
cal. The algorithm is initialized in a similar manner to 
Mizz and YZLM by setting W{ = 1 for all users i and then 
iterating over Eqs. <jH [Tl [9j) until the vector of quality 
estimates q changes less than the threshold value A. 

Artificial datasets. — To test the methods described 
above, we create artificial datasets in the following way. 
For each object a we randomly generate a real- valued true 
quality value Q a from the uniform distribution^] C/[l;i?], 
where R is an integer > 2. Similarly, for each user i we 
randomly generate a personal error level o~i from the dis- 
tribution U[a m - ln ; er m ax], where a mnl and er max scale with 
the width Ar = R— 1 of the rating scale. For a given spar- 
sity of the dataset < ri< 1, we randomly select 
unique user-object pairqj ia and generate corresponding 



3 It is possible to use non-uniform distributions, but given the 
limited rating scale this makes little practical difference. A more 
pertinent question is whether there can actually be such a thing 
as a 'true', objective quality value. The reasonableness of this as- 
sumption will vary depending on what kind of objects are being 
considered, probably with particular reference to whether an object 
will be assessed more on the basis of taste or functionality. 

4 In the extremely rare case that an object or user ends up without 
any such links, we discount them from further consideration, e.g. 
when assessing algorithms' performance. 



where the quality estimation error Ei a is drawn from the 
uniform distribution U[— o~i] ai\. The actual ratings are 
derived from these quality estimates depending on the de- 
gree of quantization desired: for continuous-valued rat- 
ings we simply take rj a = qi a , while discrete rating values 
are obtained by rounding to the nearest integer, that is, 
T% a = [qia]- In both cases, values lying outside the pre- 
scribed range [1;R] are truncated: those smaller than 1 
are changed to 1 and those greater than R are changed to 
R. This follows the real-life constraint that, no matter how 
much a user may adore or detest a particular object, they 
still cannot rate it higher or lower than the given rating 
bounds. While changing R does not produce a qualita- 
tive difference in outcome for continuous- valued ratings, 
the constraint of discrete integer values means that R de- 
termines the resolution of rating precision, that is, the 
number of distinct discrete rating values. Note that since 
we assume <r m i n and <7 max scale with Ar = R — 1 this is 
equivalent to increasing the resolution by taking a higher 
number of equally-spaced discrete rating values within a 
fixed range: increasing the width of the rating scale and 
taking integer values is simply easier to implement. 

Performance metrics. — A simple and easy test of 
algorithm performance is to compare the algorithm's es- 
timated quality values q a and the 'true' quality Q ai and 
calculate the root-mean-square error [7] , 
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using the normalization AQ/(R — 1) to compare per- 
formance on datasets with different rating resolution R. 
Since user weight is not expected to be equal to the true 
user ability, we use Kendall's r rank correlation coeffi- 
cient [11 to compare the true ability ranking of users ac- 
cording to a~ 1 with the estimated ranking given by Wi . A 
result of r = 1 indicates that the true and estimated rank- 
ings are identical, —1 that they are completely inverted, 
and that the rankings are entirely uncorrelated. 

While well defined for artificial numerical simulations, 
neither of these measures can easily be applied to real 
data, where objective measures of object quality or user 
ranking difficult or impossible to obtain. In the absence of 
reliable per-item or per-user measures of accuracy, an ef- 
fective approach is to specify a group of 'relevant' objects 
or users and inspect their position in the ranking |12j . 
To do this we employ the receiver operating characteristic 
(ROC) curve [13], constructed by plotting for each place in 
the ranking a point in [0, l] 2 whose x, y values correspond 
respectively to the proportion of irrelevant and relevant 
objects recovered so far. The ranking accuracy can then 
be estimated by the area under the curve (AUC), which 
equals 1 when every relevant object/user is ranked higher 
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Fig. 1: (colour online) Overview of co-determination algorithm performance as the upper bound <r max of user error is varied, for 
continuous and discrete (integer) valued ratings in the interval [1; 5], with <r m i n = 0, |W| = 1000, \0\ = 1000, r\ = 0.1, and results 
averaged over 100 realizations, (a-d) Accuracy in estimating object quality for (a, b) continuous and (c, d) discrete ratings, 
measured by AQ and AUC. For comparison to other figures, AQ is normalized with respect to the rating width Ar = R— 1 = 4. 
(e-h) Accuracy in ranking user ability for (e, f) continuous and (g, h) discrete ratings, measured by Kendall's r rank correlation 
coefficient and AUC. No results are shown for AA, as it does not rank users but considers them to all be equal. 



than every irrelevant object/user, 0.5 when the distribu- 
tion of relevant objects/users is random, and when every 
irrelevant object/user is ranked higher than every relevant 
object/user. In the simulations presented here, we denote 
as 'relevant' the 5% of objects/users with respectively the 
highest true quality values Q a or lowest error rjj. 

Results. — For the results presented here we gener- 
ated artificial datasets of 1000 users and 1000 objects, 
with sparsity 7/ = 0.1. For each simulation we used the 
same datasets to test each reputation algorithm. Our first 
simulations keep a constant rating resolution R = 5 and 
a constant lower bound <7 m i n = for the distribution of 
user's personal error levels, while the upper bound cr max 
was varied in the range [0;4]. This range was chosen so 
that, at its most extreme, the least skilled users (i.e. those 
with <Ji « (T m ax) could potentially rate a 'perfect' 5-star 
object with the lowest rating value 1, and vice versa. 

Figure QJi,b,e,f presents the performance of the algo- 
rithms when we use continuous-valued ratings, i.e. when 
Tia — Qia exactly, and vary the upper error bound <7 max . 
We observe immediately that YZLM is by far the least sen- 
sitive to the increasing error level, maintaining the low- 
est object quality error AQ and the best user ranking 
(Kendall's r), and the highest AUC (s=s 1) for both ob- 
jects and users. This is because of all the methods YZLM 
places the harshest sanction against 'noisy' raters who di- 
verge from the aggregate estimated quality, a feature that 
can be observed in the ranking of users, where we observe 
near-identical values of r for both YZLM and dKVD (their 
weights and hence ranking stem from the same measure di 
of user rating divergence) but consistently higher AUC for 



YZLM (the very best users are more consistently pushed 
to the top of the ranking due to the harsher sanction) . The 
superiority of YZLM is maintained across different sizes of 
dataset and different data sparsity values, and is found to 
be dependent primarily on cr m i n : if this lower error bound 
is increased, results from all four algorithms become sim- 
ilar as, in the absence of objectively good raters, there is 
much less advantage to be had in discriminating between 
better and wors^f]. 

To assess the difference between continuous- and 
discrete- valued ratings, we took the same sets of artifi- 
cial data and repeated the analysis with ratings now con- 
strained to integer values (1-5). As shown in Fig.[TJ;,d,g,h, 
this quantization has a substantial negative effect on per- 
formance, with rj max = in particular being disastrous for 
all reputation algorithms. As <7 max increases, AQ, t and 
object and user AUC improve — and then, in some cases, 
AQ and object AUC worsen again. We also notice that 
the relative performance of the methods with respect to 
AQ and object AUC is inverted for rj max < 2, with YZLM 
the worst-performing of the algorithms, regaining its su- 
periority only when the upper error bound is large. 

The apparent paradox of better performance resulting 
from increasing error can be explained as follows. Imag- 
ine an object with 'true' quality 3.4 being assessed by two 
distinct groups of users, the first whose quality assessment 
is always error- free (<7j = 0), the second whose error lev- 

5 The degree of superiority shown by YZLM actually depends 
both on the value of <Tmin the difference tr max — ^min- 

We do 

not provide a detailed illustration of this for reasons of space, but 
the effect can be observed in the differences in asymptotic values of 
AUC between the upper and lower panels of Fig. [2] 
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Fig. 2: (colour online) The dependency of algorithm per- 
formance on the discrete rating resolution R, measured by 
(a, b) AQ for objects and (c, d) AUC for users. The upper 
error bound <T max — R — 1 covers the full rating range, while 
lower error bounds are (a, c) <r m i n = 0, (b, d) <r m i n = a max /8. 
Other parameter values are as in Fig. [T] 



els are set at <7, = 0.5 (i.e. the average error level of a 
user from a group with error levels cr, drawn from U[0; 1]). 
Users from the first group will of course make correct qual- 
ity judgements qi a — Q a , but the discrete rating system 
forces them to adopt the nearest integer value of 3. The 
resulting average (also 3) will thus differ from the true 
quality by 0.4. By contrast the 'noisy' users' quality esti- 
mates will be distributed uniformly in the range Q a ± 0.5 
and so on average 60% of them will give a discrete rating 
of 3 and 40% will give 4, leaving an average of 3.4 — that is, 
on average a perfect match to the original quality value. 
Effectively, the constraint of discrete ratings produces a 
systematic quantization error, which 'noisy' users can off- 
set in the same way that dither can reduce quantization 
error in signal processing |14j . 

A slightly more subtle argument is needed to explain 
the bad performance of YZLM when faced with any but 
the largest levels of error. Here we note that, while the 
aggregate error of low-<7i agents may be greater, their in- 
dividual error will still on average be less. YZLM, with its 
strong bias towards users with low observed error rates, 
will thus favour these users, suppressing noisy agents and 
consequently harming aggregate performance. This is con- 
firmed by Fig.QJ;,h, where we observe that while overall ac- 
curacy of user ranking (Kendall's r) suffers with low cr max , 
the high AUC values indicate that the lowest-<7 2 : users are 
still being pushed towards the top of the ranking. As <7 ma x 
increases, aggregate error of the wider population grows 
and YZLM's suppression of high individual error rates acts 
to suppress this, sustaining its performance while the other 
algorithms suffer. 

To better understand the effects of changing the rating 



resolution, we performed simulations where user error was 
fixed in proportion to the width of the rating scale, and 
varied the value of R while taking discrete ratings. Fig. [2] 
shows the results for two sets of simulations, the first with 
Cmin = 0, the second with cr min = a max /8, using AQ/ (R — 
1) and AUC to measure performance in assessing objects 
(a, b) and users (c, d) respectively. In both cases er max = 
R—l, so that the maximum possible user error covers the 
full range of the rating scale. 

As we increase the rating resolution 7?, we observe a 
gradual approach to asymptotic values of object (AQ) and 
user (AUC) performance comparable to those obtained 
with continuous-valued ratings. Similar to Fig. [IF,, there 
is a marked difference between YZLM and the other al- 
gorithms with respect to AQ. Whereas AA, Mizz and 
dKVD have only a limited response to increasing resolu- 
tion, YZLM is able to reap a significant benefit, with its 
performance sustaining continuous improvement even as R 
approaches 20. The reason is made clear by Fig. [2};, where 
we observe that unlike the other algorithms, increasing R 
permits YZLM to push the lowest-o"j users consistently to 
the very top of the ranking (AUC — > 1). 

YZLM's dependency on low-cr^ raters is further empha- 
sized by Fig. [2p, where the performance of AA, Mizz 
and dKVD are little affected by the higher value of <r m i n 
but where now YZLM performs better for binary ratings 
(again, the 'increased noise=better performance' paradox) 
while no longer sustaining any significant improvements 
in AQ for R > 3. When observing AUC for user rank- 
ing (Fig. we observe that now YZLM too is unable 
to consistently push the lowest-tJi users to the top, with 
asymptotic AUC values now near- identical to dKVD. 

Discussion. — Psychometric research has put consid- 
erable effort into understanding the effectiveness and reli- 
ability of different rating scales, particularly with respect 
to the scale resolution [T5"rtl8| . Factors to take into ac- 
count include both the information-carrying capacity of 
the scale and the information-processing capacity of re- 
spondents [19], as well as psychological influences such as 
the descriptive labels associated with responses |20j . 

The relevance of these factors depends on exactly what 
kind of information one wants to extract from the scale. If 
the aim is to aggregate or average over respondents, three 
or even two discrete response options may suffice |16j . 
Conversely, if the focus is on individual difference, finer- 
grained scales become necessary [17) . 

Co-determination algorithms are prima facie aggrega- 
tion mechanisms, but they also employ measures of indi- 
vidual difference to improve the aggregation process 6 8 . 
The effect of rating resolution on their performance will 
therefore depend on several factors, including the degree 
to which there arc meaningful and reliable differences in 
user rating ability, whether the scale is fine-grained enough 
to accurately reflect those differences, and the algorithm's 
ability to measure and exploit this information if it exists. 

In this letter we have investigated the influence of 
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low rating resolution on the performance of several co- 
determination reputation and ranking algorithms. While 
the presence of an non-zero optimal noise level (Fig. [U:,d) 
may be seen as a mere mathematical curiosity — in ef- 
fect an example of quantization error being reduced by 
the application of dither [TJ] — the worsened performance 
of these methods is an important finding. Psychometric 
studies have in general suggested that there is little bene- 
fit to be had from using more than 7 discrete rating cate- 
gories |15j . Our results suggest that in fact this may pre- 
vent the maximum exploitation of rating data, precluding 
the fine-grained observation of individual difference nec- 
essary to improve the aggregation process. A comparison 
can be drawn to models of opinion dynamics inspired by 
the Potts model, where if the number of spin/opinion val- 
ues is too few, opinions become homogenized across the 
population, while as q — > oo, diverse regions of different 
opinion can be preserved [2Tj . 

We have also shown that, where the rating resolution 
is high enough, co-determination algorithms — particularly 
YZLM — are able to achieve significantly better results 
than a mere arithmetic average. Given that psychome- 
tric studies have not shown any major disadvantages of 
using higher-resolution scales [18], it may thus be prefer- 
able for modern rating and reputation systems to employ 
continuous- valued scales such as the graphic rating scale 
or the visual analogue scale [22]. In an online world such 
scales can be implemented easily through the use of per- 
centage scores or slider bars [23] ■ Empirical studies em- 
ploying these and other rating methods should be able to 
determine if and when respondents are in practice able to 
achieve the required precision of judgement, and so help to 
identify the situations where a sophisticated method may 
yield superior performance. 
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